Catalan-English Statistical Machine Translation without Parallel Corpus: Bridging through Spanish
نویسندگان
چکیده
This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. Given the language proximity between Spanish and Catalan languages, we investigate the cost of using Spanish as a bridge towards large-vocabulary Catalan-English translation in a wholly automatical statistical machine translation framework. Experimental results are promising, as the achieved translation quality is nearly equivalent to that of the Spanish-English language pair, practically carrying SMT research for the Catalan language to the level of more prominent language, in terms of data availability.
منابع مشابه
Catalan-English statistical machine translation without a parallel corpus
This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. G...
متن کاملA Large Spanish-Catalan Parallel Corpus Release for Machine Translation
We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catala...
متن کاملDevelopment of Language Resources for Speech-to-speech Translation
This paper describes the creation of linguistically enriched aligned corpora for Catalan, Spanish and US-English for Speech-to-Speech Translation. These corpora are obtained from two diierent sources: US-English transcribed speech data and transcriptions of conversations recorded in Catalan and Spanish. After human translation, a large trilingual spontaneous speech corpus has been obtained. Thi...
متن کاملTowards the Use of Word Stems and Suffixes for Statistical Machine Translation
In this paper we present methods for improving the quality of translation from an inflected language into English by making use of part-of-speech tags and word stems and suffixes in the source language. Results for translations from Spanish and Catalan into English are presented on the LC-STAR trilingual corpus which consists of spontaneously spoken dialogues in the domain of travelling and app...
متن کاملPersian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language
This paper is an attempt to exclusively focus on investigating the pivot language technique in which a bridging language is utilized to increase the quality of the Persian–Spanish low-resource Statistical Machine Translation (SMT). In this case, English is used as the bridging language, and the Persian–English SMT is combined with the English–Spanish one, where the relatively large corpora of e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006